This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.
Load libraries:
library(tidyverse) # Data manipulation
library(data.table) # Data reading
library(obisdi) # Tools for data ingestion for OBIS
library(here) # Get paths (important!)
library(arrow) # To deal with parquet files
The checklist will be downloaded from FigShare. We use the
obisdi function to do the download and also to obtain
metadata. Because the files are large, we added a line to control and
only download the data once and save the resulting metadata:
# Get the path to data/raw
raw_path <- here("data", "raw")
# See if files were already downloaded
lf <- list.files(raw_path)
if (!any(grepl("figshare", lf))) {
fig_details <- get_figshare(article_id = 7854767, download_files = T,
save_meta = T, path = raw_path)
}
Following the download the details of the dataset can be accessed from the file data/raw/figshare_metadata_20062023.csv.
Title: A fine-tuned global distribution dataset of marine
forests
Authors: Jorge Assis, Eliza Fragkopoulou, Duarte Frade, João Neiva,
André Oliveira, David Abecasis, Silvan Faugeron, Ester A. Serrão
Date (dmy format): 19/03/2020
DOI: 10.6084/m9.figshare.7854767.v1
URL: https://figshare.com/articles/dataset/A_fine-tuned_global_distribution_dataset_of_marine_forests/7854767
First we reduce the size of the raw files by converting them to the
parquet format. We keep only the flagged file which is the
one that we will include in the OBIS database.
raw_files <- list.files(raw_path, full.names = T)
file.remove(raw_files[-grep("databaseAll.csv|databaseAll.parquet|metadata", raw_files)])
# We just run the conversion in the first knitting of this document
if (any(grepl("databaseAll.csv", raw_files))) {
flagged <- fread(paste0(raw_path, "/databaseAll.csv"))
write_parquet(flagged, paste0(raw_path, "/databaseAll.parquet"))
rm(flagged)
file.remove(paste0(raw_path, "/databaseAll.csv"))
}
Now we can load the parquet file containing the dataset we will work with.
dataset <- read_parquet(paste0(raw_path, "/databaseAll.parquet"))
head(dataset)
We will filter the dataset to remove those records that are already available on OBIS. In that case, we will filter by “Ocean Biogeographic Information System” (old name) and “Ocean Biodiversity Information System”.
dataset_filt <- dataset %>%
mutate(proc_bibliographicCitation = tolower(bibliographicCitation)) %>%
filter(!grepl("ocean biogeographic information system|ocean biodiversity information system", proc_bibliographicCitation)) %>%
select(-proc_bibliographicCitation)
This dataset is already on the DwC standard, so no mapping will be
necessary. However, we need to separate the flags into a new table, what
will contain the MeasurementOrFacts:
flags <- dataset_filt %>%
select(id, starts_with("flag"))
Now we convert the flags object to the right format:
flags_conv <- flags %>%
pivot_longer(cols = 2:4,
names_to = "measurementType",
values_to = "measurementValue") %>%
mutate(measurementValue = as.numeric(measurementValue))
We can check the conversion worked by tabulating the values:
cbind(data.frame(table(flags$flagHumanCuratedDistribution)),
Freq_conv = data.frame(table(
flags_conv$measurementValue[flags_conv$measurementType == "flagHumanCuratedDistribution"]
))[,2])
cbind(data.frame(table(flags$flagMachineOnLand)),
Freq_conv = data.frame(table(
flags_conv$measurementValue[flags_conv$measurementType == "flagMachineOnLand"]
))[,2])
cbind(data.frame(table(flags$flagMachineSuitableLightBottom)),
Freq_conv = data.frame(table(
flags_conv$measurementValue[flags_conv$measurementType == "flagMachineSuitableLightBottom"]
))[,2])
That’s all we needed to do with the data for now.
As a final step, we just remove the MeasurementOrFact
column of the other object, as this will be supplied to the IPT in a
different file.
dataset_filt <- dataset_filt %>%
select(-starts_with("flag"))
And those are the final objects:
dataset_filt